Record Matching : Improving Performance in Classification

نویسندگان

  • Cyju Elizabeth Varghese
  • Naveen Sundar
چکیده

Duplication detection identifies the records that represent the same real-world entity. This is a vital process in data integration. Record matching refers to the task of finding entries that refer to the same entity in two or more files. Performing record matching solves the duplication detection problems; hence the needs for identifying the suitable record matching technique follow. Supervised methods are the current techniques used for duplication detection. This requires the user to provide training data. These methods are not applicable for the Web database scenario, where the records to match are query results dynamically generated on-the-fly. To address the problem of record matching in the Web database scenario, we present a Fast Duplication Detection, FDD, which, for a given query, can effectively identify duplicates from the query result records of multiple Web databases. Starting from the non-duplicate set, we use two, a dynamic classification classifier and an SVM classifier, to iteratively identify duplicates in the query results from multiple Web databases. Performing clustering before giving vectors to classify should produce a better result. Moreover a nonlinear SVM produce a better result in case of noise document which improves overall performance of the system. Experimental results show that FDD performs better for web database scenario. KeywordsRecord Matching; Duplication Detection; Record matching; SVM; Unsupervised

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adaptive Approximate Record Matching

Typographical data entry errors and incomplete documents, produce imperfect records in real world databases. These errors generate distinct records which belong to the same entity. The aim of Approximate Record Matching is to find multiple records which belong to an entity. In this paper, an algorithm for Approximate Record Matching is proposed that can be adapted automatically with input error...

متن کامل

A Semi-Analytical Method for History Matching and Improving Geological Models of Layered Reservoirs: CGM Analytical Method

History matching is used to constrain flow simulations and reduce uncertainty in forecasts. In this work, we revisited some fundamental engineering tools for predicting waterflooding behavior to better understand the flaws in our simulation and thus find some models which are more accurate with better matching. The Craig-Geffen-Morse (CGM) analytical method was used to predict recovery performa...

متن کامل

The Effect of Inflation Targeting on Indirect Tax Performance in Selected Countries Using Propensity Score Matching Model

Inflation targeting framework has become a predominant monetary approach across the globe. Williams (2015) believes that in a very real sense, almost all economies are inflation targeters -either explicit or implicit- now.(1) Due to the increasing spread of this policy, it is necessary to consider the way it affects macroeconomic variables. using prevalent economic models for evaluating the eff...

متن کامل

The Effect of Feedback and Incentive Mechanisms on Improving Residents’ Medical Record Documentation Procedure

Introduction: Studies indicate that using behavior changing interventions may improve medical record documentation. This study aimed to examine the effect of feedback and incentive mechanisms on medical record documentation among surgery residents in Kashan University of Medical Sciences. Methods: This quasi-experimental study examined the effect of feedback and incentive mechanisms on 19 surge...

متن کامل

Dimensionality Reduction and Improving the Performance of Automatic Modulation Classification using Genetic Programming (RESEARCH NOTE)

This paper shows how we can make advantage of using genetic programming in selection of suitable features for automatic modulation recognition. Automatic modulation recognition is one of the essential components of modern receivers. In this regard, selection of suitable features may significantly affect the performance of the process. Simulations were conducted with 5db and 10db SNRs. Test and ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011